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Abstract 

We present an approach to molecular-dynamics simulations of ferrofluids on graphics processing units (GPUs). Our 
numerical scheme is based on a GPU-oriented modification of the Barnes-Hut (BH) algorithm designed to increase the 
parallelism of computations. For an ensemble consisting of one million of ferromagnetic particles, the performance 
of the proposed algorithm on a Tesla M2050 GPU demonstrated a computational-time speed-up of four order of 
magnitude compared to the performance of the sequential All-Pairs (AP) algorithm on a single-core CPU, and two 
order of magnitude compared to the performance of the optimized AP algorithm on the GPU. The accuracy of the 
scheme is corroborated by comparing the results of numerical simulations with theoretical predictions. 
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1. Introduction 

Ferrofluids are media composed of magnetic nanoparticles of diameters in the range 10h-50 nm which are dispersed 
in a viscous fluid (for example, water or ethylene glycol) [1 1. These physical systems combine the basic properties 
of liquids, i.e. a viscosity and the presence of surface tension, and those of ferromagnetic solids possessing, i.e. 
an internal magnetization and high permeability to magnetic fields. This synergetic duality makes ferrofluids an 
attractive candidate for performing different tasks, ranging from the delivery of rocket fuel into spacecraft's thrust 
chambers under zero-gravity conditions to the high-precision drug delivery during cancer therapy [2|. Moreover, 
ferrofluids have already found their way into commercial applications as EM-controlled shock absorbers, dynamic 
seals in engines and computer hard-drives, and as key elements of high-quality loudspeakers, to name but a few |3]. 

Being liquids, ferrofluids can be modeled by using different macroscopic, continuum-media approaches. The cor- 
responding field, ferrohydrodynamics [1|, is a well-established area that has produced a lot of fundamental results. 
However, some important phenomena such as magnetoviscosity [4, 5 1 cannot be described properly neither on the hy- 
drodynamical level nor within the single-particle picture [ 6 1 . The role of multiparticle aggregates - chains and clusters 
- living in the ferrofluid bulk, is important there. The evaluation of other non-Newtonian features of ferrofluids, which 
can be controlled by external magnetic fields [5|, also demands information about structure of aggregates and their 
dynamics. The key mechanism responsible for the formation of ferro-clusters is the dipole-dipole interaction acting 
between magnetic particles. Under certain conditions, the interaction effects can overcome the destructive effects of 
thermal fluctuations and contribute tangibly to the ensemble dynamics. Therefore, in order to get a deeper insight into 
the above-mentioned features of ferrofluids, the interaction effects should be explicitly included into the model. 

Strictly speaking, the dipole-dipole interaction acts between all the possible pairs of N ferromagnetic particles. 
Therefore, the computational time of the straightforward sequential algorithm scales like A^ 2 . This is a fundamental 
drawback of many-body simulations and standard, CPU-based computational resources often limit the scales of the 
molecular-dynamics calculations. In order to run larger ensembles for longer times, researchers either (i) advance their 
models and numerical schemes and/or (ii) rely on more efficient computers. The first track led to different methods 
developed to explore the equilibrium properties of bulk ferrofluids. Most prominent are cut-off sphere approximations 
Q and the Ewald summation technique |8|, see also Ref. |9| for different modifications of the technique. However, 
both methods are of limited use for computational studies of confined ferrofluids, a problem that attracts special 
attention nowadays due to its practical relevance ifTOlfTTI . An alternative approach is the search for ways to increase 
scalability and parallelism of computations, see Ref. Ifl2l . Until recently, this practically meant the use of either large 
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computational clusters, consisting of many Central Processing Units (CPUs), or massively parallel supercomputers, 
like Blue Gene supercomputers. The prices and the maintenance costs of such devices are high. The advent of 
the general-purpose computing on graphics processing units (GP 2 U) lfl3ll has changed the situation drastically and 
boosted simulations of many -body systems onto new level 1 14 1. GPUs, initially designed to serve as the data pipelines 
for graphical information, are relatively small, much easier to maintain, and possess high computation capabilities 
allowing for parallel data processing. Nowadays, the scientific GPU-computing is used in many areas of computational 
physics, thanks to the Compute Unified Device Architecture (CUDA) developed by NVIDIA Corporation flT3]| . CUDA 
significantly simplifies GPU-based calculations so now one could use a Sony PlayStation 3 as a multi-core computer 
by programming it with C, C + + or Fortran [ 16] languages. 

The typical scale of molecular-dynamics simulations in ferrofluid studies is within the range N = 10 2 — 10 3 
lfT71 [181 . while N - 1Q 4 constitutes the current limit |[T9l . It is evident that the increase of the ensemble size by 
several orders of magnitude would tangibly improve the statistical sampling and thus the quality of simulation results. 
To run numerically 10 5 -r 10 6 interacting magnetic particles is a task not much different from the running of an 
artificial Universe l20l . Therefore, similar to the case of Computational Cosmology ll2Tll22ll23l . the GP 2 U seems to 
be very promising also in the context of ferrofluid simulations. In this paper we report the performance of a recently 
proposed GPU-oriented modification of the Barnes-Hut algorithm [24 1 used to simulate the dynamics ofN= 10 3 -r 10 6 
interacting ferromagnetic particles moving in a viscous medium. 

Although being mentioned as a potentially promising approach (see, for example, Ref. l25l ). the Barnes-Hut 
algorithm was never used in molecular-dynamics ferrofluid simulations, to the best of our knowledge. Our main goal 
here is to demonstrate a sizable speed-up one can reach when modeling such systems on a GPU - and by using 
GPU-oriented numerical algorithms - compared to the performance of conventional, CPU-based algorithms. We also 
demonstrate the high accuracy of the Barnes-Hut approximation by using several benchmarks. The paper is organized 
as follows: First, in Section II we specify the model. Then, in Section III, we describe how both, the All-Pairs and 
Barn-Hut algorithms, can be efficiently implemented for GPU-based simulations. The results of numerical tests are 
presented and discussed in Section V. Finally, Section VI contains conclusions. 



2. The model 

The model system represents an ensemble of N identical particles of the radius R, made of a ferromagnetic material 
of density D and specific magnetization /i. Each particle occupies volume V = ^nR 3 , has magnetic moment m = m(t) 
of constant magnitude \m\ = m = V/j., mass M = VD, and moment of inertia / = ^MR 2 . The ensemble is dispersed in 
a liquid of viscosity rj. Based on the Langevin dynamics approach, the equations of motions for k-th nanoparticle can 
be written in the following form ifTTl 

W k = N kz - G r 8k + Eg, (1) 
I<p k = -N kx sin <p k + N ky cos <p k - G r ip k + E£, (2) 

Mf k = ^(^k)-H k + F s k r -GA + ^ d f , (3) 

where 9 and <p are the polar and azimuthal angles of the magnetization vector in respectively. N kx = m ky H kz — 
m kz H ky ,N ky = m kz H kx - m kx H kz ,N kz = m ky H kx - m kx H ky , x,y,z denotes the Cartesian coordinates, dots over the 
variables denote the derivatives with respect to time. r k is the radius-vector defining the nanoparticle position, and 
the gradient is given by V* = Jr = ^xj^ + £y~$z + ^z~§r k , (^x,e y ,e z are the unit vectors of the Cartesian coordinates). 
Constants G, - 6nrjR, and G r = 8miR 3 specify translational and rotational friction coefficients, /j.o = 4-n ■ 10~ 7 H/m is 
the magnetic constant. 

The resulting field acting on the k-th particle H k is the sum of external field H ext and overall field exerted on the 
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particle by the rest of the ensemble, 
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where r k j — r k - fj. F s k r in Eq. [^denotes the force induced by a short-range interaction potential. In this paper we use 



Lennard-Jones potential [17| (though hard sphere 
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soft sphere |27] and Yukawa-type [28 1 potentials can be used 
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Here E is the depth of the potential well and s is the equilibrium distance at which the inter-particle force vanishes. 
The interaction between a particle and container walls is also modeled with a Lennard-Jones potential of the same 
type. The random-force vector, representing the interaction of a particle with thermal bath, has standard white-noise 
components, (Z r a (t)) = (a = <p,ff), (3Z(f)) — (y8 = x,y,z), and second moments satisfying (n. ra (t)a r p(f)) = 
3k B TG r 6 af3 6(t - f) \a,P = <p,ff) (E da (t)E df3 (t')} = 2k B TG,5 afi 5{t - f) (a,J3 = x,y,z) ED- Here k B is the Boltzmann 
constant and T is the temperature of the heat bath. 



By rescaling the variables, r = t/T c h,(T c i, = R/p ^3D/4po), u 
motions for the k-th particle can be rewritten in the reduced form, 
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Here % ext = 3fl ex ' /4np, T t = G,T ch /VD,r r = G r T ch /VR 2 D,cr = s/R,s = ET ch 2 /VRD,p kj = p k ~Pj 

Random-force vector components are given now by white Gaussian noises, with the second moments satisfying 
<&(t)§(t')> = 3k B Tr r 6 a/ ,6(T-T')/T ch (a,{3 = <p,6) <^(r)^(r')> = 2k B Tr t 6 afi 6(T-T')/T ch (a,{3 = x,y,z). In general 
case the characteristic relaxation time of particle magnetic moments to their equilibrium orientations is much smaller 
than T c i,, and one can neglect the magnetization dynamics by assuming that the direction of the vector m k coincides 
with the easy axis of the k-th particle. We assume that this condition holds for our model. 
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3. Two approaches to many-body simulations on GPUs: the All-Pairs and the Barnes-Hut algorithms 

In this section we discuss two alternative approaches to the numerical propagation of the dynamical system given 
by Eqs. ([v] - [9J on a GPU. It is assumed that the reader is familiar with the basics of the GP 2 U, otherwise we address 
him to Refs. (|29 30|) that contain crash-course-like introductions into the physically-oriented GPU computing. 

3.1. All-Pair Algorithm 

The most straightforward approach to propagate a system of N interacting particles is to account for the inter- 
actions between all pairs. Although exact, the corresponding All-Pairs (AP) algorithm is slow when performed on 
a CPU, so it is usually used to propagate systems of N - 10 2 -r 10 3 particles. However, even with this brute-force 
method one can tangibly benefit from GPU computations by noticing that the AP idea fits CUDA architecture IT3D . 

One integration step of the standard AP algorithm is performed in two stages. They are: 

1. Calculation of increment of particle positions and magnetic moments directions. 

2. Update particle positions and magnetic moments directions. 

This structure remains intact in the GPU version of the algorithm. Kernels responsible for the first stage compute 
forces that act on the particles, and calculate the corresponding increments for particle positions and magnetic mo- 
ments, according to equations (|7]-[9]i. The increments are then written into the global memory. Finally, second-stage 
kernels update particle states with the obtained increments. There is, however, a need for the global synchronization 
of the threads that belong to the different blocks after every stage since the stages are performed on separate CUDA 
kernels and all the information about the state of the system is kept in the shared memory. 

Each thread is responsible for one particle of the ensemble, and thus it should account for the forces exerted on 
the particle by the rest of the ensemble. To speed up the computational process we keep the data vector of the thread 
particle in the shared memory, as well as the information on other particles, needed to compute the corresponding 
interaction forces. Thus we have two sets of arrays of data in the shared memory, namely 

1 . data of the particles assigned to the threads of the block; 

2. data of particles to compute interaction with. 

The necessary data are the coordinates of the particles and projections of their magnetic moment vectors onto x, 
y and z axis. Size of the arrays are equal to the number of threads per block. The first set of the data is constant 
during one integration step, but the second set is changed. So, at the beginning we upload the information on particle 
coordinates and momenta to the second set of arrays in the shared memory. After computing the forces acting on the 
block particles, the procedure is repeated, i. e. the information on another set of particles is written into the second set 
of arrays and the corresponding interactions are computed. The corresponding pseudo-code is presented the below^] 

The advantage of the described approach is that it uses the global memory in the most optimal way. The access to 
global memory is coalesced and there are no shared memory bank conflicts. For an ensemble N = 10 4 particles this 
leads to the GPU occupancy 0of 97.7%. 

3.2. Barnes-Hut Algorithm 

The All-Pairs algorithm is simple and straightforward for implementation on a GPU and perfectly fits CUDA. Yet 
this algorithm is purely scalable. The corresponding computation time grows like 0(N 2 ), and its performance is very 
slow already for an ensemble of 10 5 particles. 

The Barnes-Hut (BH) approximation Il32l exhibits a much better scalability and its computational time grows like 
0(N log N). The key idea of the algorithm is to substitute a group of particles with a single pseudo-particle mimicking 
the action of the group. Then the force exerted by the group on the considered particle can be replaced with the force 
exerted by the pseudo-particle. In order to illustrate the idea assume that all particles are located in a three-dimensional 



'in the pseudo-code Ap x , Ap y , Ap z , A8, A<p denote increments of a particle's coordinates x, y, z, and the direction angles of particle magnetic 
moment, 8 and ip, needed to propagate the particle over one time step At. 

2 The parameter shows how the GPU is kept busy, and it is equal to the ratio of the number of active warps to the maximum number of warps 
supported on a multiprocessor of the GPU. It can obtained with CUDA Profiler. 
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Algorithm 1 A CUDA kernel that computes increments. 



cached k <— blockldx.x ■ blockDim.x + threadldx.x 
for j — to numberOf Particles do 

upload to the shared memory particle positions and angles 
e.g. x shared [threadldx.x] = x g i obal [ind], etc. 
for i — to blockDim.x do 

if j + i + k then > the condition to avoid particle on-self influence 

Calculate force and dipole field for the particle number j + i 
on current A;-th particle, and add the result to 
the total cached f di P, f sr , t dip . 
end if 
end for 

__syncthreads(); 
j <— j + blockDim.x 
end for 

Update cached Ap x , Ap y , Ap z , AO, Aip according to equations [7] -[9] 
Copy increments to global memory. 



cube, which is named 'main cell'. The main cell is divided then into eight sub-cells. Each sub-cell confines subset of 
particles. If there are more than one particle in the given sub-cell then the last is again divided into eight sub-cells. 
The procedure is re-iterated until there is only one particle or none left in each sub-cell. In this way we can obtain 
an octree with leaves that are either empty or contain single particle only. A simplified, two-dimensional realization 
of this algorithm is sketched with Fig. [T] By following this recipe, we can assign to every cell, obtained during 
the decomposition, a pseudo-particle, with magnetic moment u' equals to the sum of magnetization vectors it of all 
particles belonging to the cell, and position p' which is the position of the set geometric center, 

N' 

->, <=1 ft A\ 



where N' is a number of particles in the cell and p, is the position of the z'-th particle from the sub-set. 

Forces acting on £-th particle can be calculated by traversing the octree. If the distance from the particle to the 
pseudo-particle that corresponds to the root cell is large enough, the influence of this pseudo-particle on the &-th 
particle is calculated; otherwise pseudo-particles of the next sub-cells are checked etc (sometime this procedure can 
lead finally to a leaf with only one particle in the cell left). Thus calculated force is added then to the total force acting 
on the k-th particle. 

The BH algorithm allows for a high parallelism and it is widely employed in computational astrophysics problems 
Il33ll . However, implementation of the Barnes-Hut algorithm on GPUs remained a challenge until recently, because 
the procedure uses an irregular tree structure that does not fit the CUDA architecture well. It is the main reason why 
the BH scheme was not realized entirely on a GPU but some part of calculations was always delegated and performed 
on a CPU 1 34 35 1. A realization of the algorithm solely on a GPU has been proposed in 2011 |24|. Below we briefly 
outline the main idea and specify its difference from the standard, CPU-based realization. 

To build an octree on a CPU usually heap objects are used. These objects contain both child-pointer and data 
fields, and their children are dynamically allocated. To avoid time-consuming dynamic allocation and accesses to 
heap objects, an array-based data structure should be used. Since we have several arrays responsible for variables, the 
coalesced global memory access is possible. Particles and cells can have the same data fields, e.g. positions. In this 
case the same arrays are used. 

In contrast to the All-Pairs algorithm, where only two kernels were involved, in the original GPU-BH algorithm 
has six kernels |24l : 

1 . Bounding box definition kernel. 

2. Octree building kernel. 
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□ a cell with particles 
• a particle ° an empty cell 



Figure 1: Bames-Hut hierarchical decomposition in two-dimensional space and the corresponding octree. See Section 3.2 for more details. 

3. Computing geometrical center and total magnetic moment of each cell. 

4. Sorting of the particles with respect to their positions. 

5. Computing forces and fields acting on each particle. 

6. Integration kernel. 

Kernel 1 defines the boundaries of the root cell. Though the ensemble is confined to a container and particles 
cannot go outside, we keep this kernel. The size of the root cells can be significantly smaller than the characteristic 
size of the container. Moreover, the computation time of this kernel is very small, typically much less than 1% of 
the total time of one integration step. The idea of this kernel is to find minimum and maximum values of particle 
positions. Here we use atomic operations and built-in min and max functions. 

Kernel 2 performs hierarchical decomposition of the root cell and builds an octree in the three-dimensional case. 
As well as in following kernels, the particles are assigned to the threads in round-robin fashion. When a particle 
is assigned to a thread, it tries to lock the appropriate child pointer. In the case of success, the thread rewrites the 
child pointer and releases a lock. To perform a lightweight lock, that is used to avoid several threads access to the 
same part of the tree array, atomic operation should be involved. To synchronize the tree-building process we use the 
syncthreads barrier. 

Kernel 3 calculates magnetic moments and positions of pseudo-particles associated with cells by traversing un- 
balanced octree from the bottom up. A thread checks if magnetic moment and geometric center of all the sub-cells 
assigned to its cell have already been computed. If not then the thread updates the contribution of the ready cells and 
waits for the rest of the sub-cells. Otherwise the impact of all sub-cells is computed. 

Kernel 4 sorts particles in accordance to their locations. This step can speed up the performance of the next kernel 
due to the optimal global memory access. 

Kernel 5 first calculates forces acting on the particles, and then calculates the corresponding increments. Then,in 
order to compute the force and dipole field acting upon the particle, the octree is traversed. To minimize thread 
divergence, it is very important that spatially close particles belong to the same warp. In this case the threads within 
the warp would traverse approximately the same tree branches. This has already been provided by kernel 4. The 
necessary data to compute interaction are fetched to the shared memory by the first thread of a warp. This allows to 
reduce number of memory accesses. 

Finally, kernel 6 updates the state of the particles by using the position increments and re-orient particle magnetic 
moments. 

The above-described algorithm has many advantages. Among them are minimal thread divergence and the com- 
plete absence of GPU/CPU data transfer (aside of the transfer of the final results), optimal use of global memory with 
minimal number of accesses, data field re-use, minimal number of locks etc. All this allows to achieve a tangible 
speed-up. For more detailed description we direct the interested reader to Ref. [24 1. 

4. Results 

We performed simulations on (i) a PC with Intel Xeon x5670 @2.93GHz CPU(48 Gb RAM) and (ii) a Tesla 
M2050 GPU. Though the CPU has six cores, only a single core was used in simulations. The programs were compiled 
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N 


APcpu 


BHcpu 


APqpu 


BHcpu 


APcpu 
APnpn 


APcpu 
B Hr, pi i 


APgpv 
BHgpti 


10 3 


34 


9.8 


0.7 


2 


49 


17 


0.35 


10 4 


3 470 


137 


20 


6.5 


174 


534 


3.1 


10 s 


392 000 


2 487 


1 830 


54 


214 


7 259 


33.9 


10 6 


39 281 250 


73 121 


184 330 


621 


213 


63 214 


297 



Table 1: Duration of single integration step (ms) for the optimized All-Pairs algorithm implemented on CPU (APcpu) and GPU (APqpu), and for 
the CPU- (BH C pu) and GPU-oriented (BH G pu) Barnes-Hut algorithm. 



with nvcc (version 4.0) and gcc (version 4.4.1) compilers. Since there was no need in high-precision calculations, 
we used single-precision variables (float) and compiled the program with -use_f astjmath key. We also used -03 
optimization flag to speedup our programs. Finally, the Euler-Maruyama method with time step At = 0.001 was used 
to integrate Eqs. ([7]- [9). 

We measured the computation time of one integration step for both algorithms as functions of N. The results are 
presented with Table [T] The benefits of the GPU computing increase with the number of particles. For an ensemble 
of N — 10 6 particles the speed-up gained from the use of the Barnes-Hut algorithm is almost 300 compared to the 
performance of the performance of the optimized All-Pairs algorithm on the same GPU. However, for N = 10 3 the 
All-Pairs algorithm performs better. It is because the computational expenses for the tree -building phase, sorting etc., 
overweight the speed-up effect of the approximation for small number of particles. Here we remind that N — 10 3 was 
the typical scale of the most of ferrofluid simulations to date lfT71fT8l . 

Fig. |2]shows instantaneous configurations obtained during the simulations for a cubic confinement and for a mono- 
layer. To simulate a mono-layer of particles we use a rectangular parallelepiped of the height 2.1R as a confinement. 
The parameters of the simulations correspond to the regime when the average dipole energy is much larger than the 
energy of thermal fluctuations. The formation of chain-like large-scale clusters l36l is clearly visible. 

5. A benchmark test: average magnetization curves 

In order to check the accuracy of the numerical schemes we calculated the reduced magnetization curve 

N 

Reduced magnetization vector is given by the sum (m) = 1/N 2 «/■ The main parameters that characterize ferrofluid 

!=1 

magnetic properties are the dipole coupling constant A, which is the ratio of dipole-dipole potential and thermal energy, 
namely 1371 . 

A= W f , (15) 

and the volume fraction, which is the ratio between the volume occupied by particles and the total volume occupied 
by the ferrofluid, Vf, i.e., 

,= 1^.100%. (16) 

v f 

In the limit when A < 1 and for small volume fraction, <p <K 100%, the projection of the reduced magnetization vector 
on the direction of applied field, (ug), can well be approximated by the Langevin function ll36l : 

1 

{Uri) = L(a) = coth(a) - -, (17) 
a 

where a denotes the ratio between magnetic energy and thermal energy, 

a = mnoH/knT. (18) 
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Figure 2: Snapshots of iV-particle ensembles obtained with the Barnes-Hut algorithm: (a) N = 20 000 (cubic confinement with the edge length 
L = 150R) and (b) mono-layer of N = 300 particles. The parameters are: T r = r d = 0.1, T = 300 K,p = 3.1 ■ 10 5 A/m, D = 5000 kg/m 3 , R = 10 
run, 

We simulated a system with the parameters corresponding to maghemite (y - Fe2C>3) with a saturation magnetiza- 
tion of yU = 3. 1 • 10 5 A/m and density D = 5000 kg/m 3 , the carrier viscosity rj = 0.89 • 10~ 3 Pa (the latter corresponds to 
the water viscosity at T — 298 K). The volume fraction is set at <p = 1% and the particle radius R — 3 nm. The external 
magnetic field H was applied along z-axis. We initiated the system at time t — by randomly distributing particles in 
a cubic container. The orientation angles of particle magnetization vectors were obtained by drawing random values 
from the interval [0, 2jt]. 

Fig. [^presents the results of the simulations. After the transient T eq = 1000, given to the system of N = 10 5 
particles to equilibrate, the mean reduced magnetization has been calculated by averaging (u z ) over the time interval 
T ca ic = 1000. It is noteworthy that even single-run results are very close to the Langevin function, see Fig. [3ja). 
The contribution of the magnetostatic energy grows with a so that the strength of the dipole-dipole interaction is also 
increasing, see Fig. [4] 

Since the average value of dipole field projection on z axis is positive and increases with a, the dipole field 
amplifies the external magnetic field. This explains the discrepancy between the analytical and numerical results 
obtained for large values of a. For an ensemble of N = 10 3 particles the results obtained with two algorithms are near 
identical, Fig. [3](b). 
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Figure 3: Ch eckin g the Barnes-Hut approximation: a) comparison between the single-run results obtained with N = 10 5 particles and Langevin 
function, Eq. jl7| ; b) comparison of the results obtained with the All-Pair and the Barnes-Hut algorithms for an ensemble of N = 10 3 particles. 

It is important also to compare the average dipole fields, (h dlp ), calculated with the Barnes-Hut and the All-Pairs 
algorithms. The outputs obtained for the above-given set of parameters are shown on Figj4] Again, two algorithms 
produced almost identical results. 

0.04 



0.03 



^ 0.02 



0.01 



x Barnes-Hut 
♦ All-Pairs 



a 



Figure 4: z-component of the reduced average dipole field as a function of a 1181 . The parameters are the same as in Fig. [5Jb). Each point 1 
obtained by averaging over 10 3 independent realizations. 



6. Conclusions 

With this work we demonstrate that the Barnes-Hut algorithm can be efficiently implemented for large-scale, GPU- 
based ferrofluid simulations. Overall, we achieved a speed-up more than two orders of magnitude when compared 
to the performance of a common GPU-oriented All-Pairs algorithm. The proposed approach allows to increase the 
size of ensembles by two orders of magnitude compared to the present-day scale of simulations lfl9l . The Barnes-Hut 
algorithm correctly accounts for the dipole-dipole interaction within an ensemble of N = 10 6 particles and produces 
results that fit the theoretical predictions with high accuracy. Our finding opens several interesting perspectives. 
First, it brings about possibilities to perform large-scale molecular-dynamics simulations for the time evolution of 
ferrofluids placed in confinements of complex shapes like thin vessels or tangled pipes, where the boundary effects 
play an important role l38l . It is also possible to explore the non-equilibrium dynamics of ferrofluids, for example, 
their response to different types of externally applied magnetic fields, such as periodically alternating fields [39 1, 
or gradient fields ll40l . Another direction for further studies is the exploration of the relationship between shape 
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and topology of nano-clusters and different macroscopic properties of ferrofluids RTl l42l . Finally, the GPU-based 
computational algorithms provide with new possibilities to study heat transport processes in ferrofluids B31l . for 
example to investigate the performance of ferromagnetic particles as heat sources for magnetic fluid hyperthermia 

MM- 
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